EDA for investing subreddit

Setup

# Setup - Run only once per Kernel App
%conda install https://anaconda.org/conda-forge/openjdk/11.0.1/download/linux-64/openjdk-11.0.1-hacce0ff_1021.tar.bz2

# install PySpark
%pip install pyspark==3.4.0

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")
Retrieving notices: ...working... done

Downloading and Extracting Packages:


## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - conda-forge/openjdk/11.0.1/download/linux-64::openjdk==11.0.1=hacce0ff_1021


The following NEW packages will be INSTALLED:

  openjdk            conda-forge/openjdk/11.0.1/download/linux-64::openjdk-11.0.1-hacce0ff_1021 



Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Note: you may need to restart the kernel to use updated packages.
Collecting pyspark==3.4.0
  Using cached pyspark-3.4.0-py2.py3-none-any.whl
Requirement already satisfied: py4j==0.10.9.7 in /opt/conda/lib/python3.11/site-packages (from pyspark==3.4.0) (0.10.9.7)
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0
Note: you may need to restart the kernel to use updated packages.

Building the Spark session

# Import pyspark and build Spark session
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

print(spark.version)
Warning: Ignoring non-Spark config property: fs.s3a.aws.credentials.provider
Ivy Default Cache set to: /home/sagemaker-user/.ivy2/cache
The jars for the packages stored in: /home/sagemaker-user/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-b02b0b31-b2cc-4ff4-a0ef-5bb8148d9c77;1.0
    confs: [default]
    found org.apache.hadoop#hadoop-aws;3.2.2 in central
    found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central
:: resolution report :: resolve 190ms :: artifacts dl 6ms
    :: modules in use:
    com.amazonaws#aws-java-sdk-bundle;1.11.563 from central in [default]
    org.apache.hadoop#hadoop-aws;3.2.2 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-b02b0b31-b2cc-4ff4-a0ef-5bb8148d9c77
    confs: [default]
    0 artifacts copied, 2 already retrieved (0kB/5ms)
24/12/13 02:15:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
:: loading settings :: url = jar:file:/opt/conda/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
3.4.0

Fetching the data from S3

%%time
import sagemaker
from sagemaker.spark.processing import PySparkProcessor

# Setup the PySpark processor to run the job. Note the instance type and instance count parameters. SageMaker will create these many instances of this type for the spark job.
role = sagemaker.get_execution_role()
spark_processor = PySparkProcessor(
    base_job_name="sm-spark-project",
    framework_version="3.3",
    role=role,
    instance_count=4,
    instance_type="ml.m5.xlarge",
    max_runtime_in_seconds=3600,
)

# s3 paths
session = sagemaker.Session()
bucket = session.default_bucket()
s3_dataset_path_commments = "s3://bigdatateaching/reddit-project/reddit/parquet/comments/yyyy=*/mm=*/*.parquet"
s3_dataset_path_submissions = "s3://bigdatateaching/reddit-project/reddit/parquet/submissions/yyyy=*/mm=*/*.parquet"
output_prefix_data = "project"
output_prefix_logs = f"spark_logs"

# modify this comma separated list to choose the subreddits of interest
#subreddits = "technology,chatgpt"
subreddits = "investing,cryptocurrency"
    
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
CPU times: user 2.51 s, sys: 325 ms, total: 2.84 s
Wall time: 4.14 s
%%time
s3_path = f"s3a://{bucket}/{output_prefix_data}/comments"
print(f"reading submissions from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
reading submissions from s3a://sagemaker-us-east-1-381491950264/project/comments
shape of the comments dataframe is 3,877,393x17
CPU times: user 103 ms, sys: 50.1 ms, total: 153 ms
Wall time: 3min 58s
24/12/13 02:16:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
[Stage 1:======================================================>(116 + 1) / 117]                                                                                
comments.printSchema()
root
 |-- author: string (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- body: string (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: long (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- edited: double (nullable = true)
 |-- gilded: long (nullable = true)
 |-- id: string (nullable = true)
 |-- link_id: string (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- retrieved_on: long (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)
# display a subset of columns
comments.select("subreddit", "author", "body", "parent_id", "link_id", "id", "created_utc").show()
+--------------+--------------------+--------------------+----------+----------+-------+-----------+
|     subreddit|              author|                body| parent_id|   link_id|     id|created_utc|
+--------------+--------------------+--------------------+----------+----------+-------+-----------+
|CryptoCurrency|       TexasBoyz-713|Dude I wouldn’t s...|t1_jsjxgdf|t3_153et67|jsjxjpu| 1689744125|
|CryptoCurrency|         bthemonarch|Moons are the rea...|t1_jsjxcqm|t3_153et67|jsjxjuv| 1689744128|
|CryptoCurrency|        keithwee0909|Should I buy some...|t1_jsjxfk8|t3_153et67|jsjxk0f| 1689744130|
|CryptoCurrency|           [deleted]|           [removed]|t3_153et67|t3_153et67|jsjxk5v| 1689744133|
|CryptoCurrency|       Traveler-0854|¢0.63\n\nWe are o...|t3_153et67|t3_153et67|jsjxk5x| 1689744133|
|CryptoCurrency|       Illicitterror|    Prime Trustusbro|t3_153axby|t3_153axby|jsjxk66| 1689744133|
|CryptoCurrency|              vip887|Yup, a tiny mista...|t3_153ky0e|t3_153ky0e|jsjxk85| 1689744134|
|CryptoCurrency|         chintokkong|New Yorkers will ...|t3_153et67|t3_153et67|jsjxkfo| 1689744137|
|CryptoCurrency|       AutoModerator|\nHere is a [Nitt...|t1_jsjxkfo|t3_153et67|jsjxkgp| 1689744138|
|CryptoCurrency|           ziggyzago|I’m having crazy ...|t3_153et67|t3_153et67|jsjxkk4| 1689744140|
|CryptoCurrency|                R4ID|with how fast it ...|t3_153lm7j|t3_153lm7j|jsjxkkr| 1689744140|
|CryptoCurrency|           RedBunery|> Cathie Wood ass...|t3_153ip93|t3_153ip93|jsjxks6| 1689744143|
|CryptoCurrency|            ccModBot|Hello!\n\nDirect ...|t3_153lm7j|t3_153lm7j|jsjxl1m| 1689744148|
|CryptoCurrency|CoolCoolPapaOldSkool|Yellow for me for...|t3_153lm7j|t3_153lm7j|jsjxl24| 1689744148|
|CryptoCurrency|       Agile_Ad_7061|No. Prince from N...|t3_153ky0e|t3_153ky0e|jsjxl4z| 1689744150|
|CryptoCurrency|           Frogmangy|Literally know a ...|t1_jsjvuyx|t3_153ky0e|jsjxl5i| 1689744150|
|CryptoCurrency|  ImNotFromThisWorld|Man I should have...|t3_153et67|t3_153et67|jsjxlx6| 1689744163|
|CryptoCurrency|          send_tacoz|Aren't Doge price...|t3_153krw1|t3_153krw1|jsjxmgi| 1689744172|
|CryptoCurrency|      OddIndication4|Ask yourself the ...|t3_1539jks|t3_1539jks|jsjxmh7| 1689744172|
|CryptoCurrency|             chili21|           !gas nova|t3_12pf9pb|t3_12pf9pb|jsjxmpc| 1689744176|
+--------------+--------------------+--------------------+----------+----------+-------+-----------+
only showing top 20 rows
%%time
s3_path = f"s3a://{bucket}/{output_prefix_data}/submissions"
print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
reading submissions from s3a://sagemaker-us-east-1-381491950264/project/submissions
shape of the submissions dataframe is 205,720x21
CPU times: user 12.9 ms, sys: 11.8 ms, total: 24.7 ms
Wall time: 38.8 s
[Stage 6:=====================================================>   (17 + 1) / 18]                                                                                
submissions.printSchema()
root
 |-- author: string (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- created_utc: long (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- edited: double (nullable = true)
 |-- id: string (nullable = true)
 |-- is_self: boolean (nullable = true)
 |-- locked: boolean (nullable = true)
 |-- num_comments: long (nullable = true)
 |-- over_18: boolean (nullable = true)
 |-- quarantine: boolean (nullable = true)
 |-- retrieved_on: long (nullable = true)
 |-- score: long (nullable = true)
 |-- selftext: string (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)
 |-- title: string (nullable = true)
 |-- url: string (nullable = true)
# display a subset of columns
submissions.select("subreddit", "author", "title", "selftext", "num_comments", "created_utc").show()
+--------------+-------------------+--------------------+--------------------+------------+-----------+
|     subreddit|             author|               title|            selftext|num_comments|created_utc|
+--------------+-------------------+--------------------+--------------------+------------+-----------+
|     investing|       WichitaFlyer|Will negative pop...|Millennial and ge...|           0| 1689714390|
|CryptoCurrency|            V11c7or|            AITrader|           [removed]|           0| 1689714454|
|     investing| Ok_Supermarket9812|Tools to assess f...|           [removed]|           1| 1689714522|
|CryptoCurrency|    charlythesecond|Terraform Labs Co...|                    |           0| 1689714532|
|CryptoCurrency|    badfishbeefcake|Scam Alert: Celsi...|           [removed]|           2| 1689714536|
|CryptoCurrency|   Fun-Juggernautyy|Opinions on Hatom...|           [removed]|           1| 1689714660|
|CryptoCurrency|   Fun-Juggernautyy|Opinions on Hatom...|           [removed]|           1| 1689714664|
|CryptoCurrency|     OneThatNoseOne|Today, many peopl...|Safe to say that ...|           0| 1689714869|
|     investing| Ok_Supermarket9812|Tool to assess fu...|I am looking for ...|           0| 1689714939|
|CryptoCurrency|          [deleted]|SUI, its hype and...|           [removed]|           1| 1689715044|
|CryptoCurrency|         garchmodel|if arkham was rea...|           [removed]|           1| 1689715090|
|CryptoCurrency|    Fun_Problem_914|hola mrbeast no m...|           [removed]|           1| 1689715093|
|CryptoCurrency|          syndoms18|SUI, its hype and...|           [removed]|           1| 1689715103|
|CryptoCurrency|        RassuEst112|Are There Any Non...|           [removed]|           1| 1689715260|
|     investing|            StophJS|Your ETF portfoli...|This may be flagg...|           0| 1689715325|
|CryptoCurrency|   DrHunterThompson| MOONS Mass Adoption|           [removed]|           1| 1689715452|
|CryptoCurrency|           bdiggles|Where can U.S. tr...|           [removed]|           1| 1689715500|
|CryptoCurrency|        Oneiros1999|Prime Trust Put I...|                    |           0| 1689715584|
|CryptoCurrency|Artorias_the_hollow|I bought some Moo...|           [removed]|           1| 1689715619|
|CryptoCurrency|  Nearby_Market3054|Bitcoin mining an...|           [removed]|           1| 1689715639|
+--------------+-------------------+--------------------+--------------------+------------+-----------+
only showing top 20 rows

Submissions data overview and quality check

%%time
s3_path = f"s3a://{bucket}/{output_prefix_data}/submissions"
print(f"reading submissions from {s3_path}")
submissions = spark.read.parquet(s3_path, header=True)
print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
reading submissions from s3a://sagemaker-us-east-1-381491950264/project/submissions
shape of the submissions dataframe is 205,720x21
CPU times: user 16.3 ms, sys: 2.75 ms, total: 19 ms
Wall time: 26.5 s
                                                                                
submissions.printSchema()
root
 |-- author: string (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- created_utc: long (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- edited: double (nullable = true)
 |-- id: string (nullable = true)
 |-- is_self: boolean (nullable = true)
 |-- locked: boolean (nullable = true)
 |-- num_comments: long (nullable = true)
 |-- over_18: boolean (nullable = true)
 |-- quarantine: boolean (nullable = true)
 |-- retrieved_on: long (nullable = true)
 |-- score: long (nullable = true)
 |-- selftext: string (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)
 |-- title: string (nullable = true)
 |-- url: string (nullable = true)
invest_sub =  submissions.filter(submissions.subreddit == "investing")
invest_sub.show()
+--------------------+----------------------+-----------------+-----------+-------------+--------------+------+-------+-------+------+------------+-------+----------+------------+-----+--------------------+--------+---------+------------+--------------------+--------------------+
|              author|author_flair_css_class|author_flair_text|created_utc|distinguished|        domain|edited|     id|is_self|locked|num_comments|over_18|quarantine|retrieved_on|score|            selftext|stickied|subreddit|subreddit_id|               title|                 url|
+--------------------+----------------------+-----------------+-----------+-------------+--------------+------+-------+-------+------+------------+-------+----------+------------+-----+--------------------+--------+---------+------------+--------------------+--------------------+
|        WichitaFlyer|                  null|             null| 1689714390|         null|self.investing|  null|153af75|   true| false|           0|  false|     false|  1689714416|    1|Millennial and ge...|   false|investing|    t5_2qhhq|Will negative pop...|https://www.reddi...|
|  Ok_Supermarket9812|                  null|             null| 1689714522|         null|self.investing|  null|153ah5n|   true| false|           1|  false|     false|  1689714551|    1|           [removed]|   false|investing|    t5_2qhhq|Tools to assess f...|https://www.reddi...|
|  Ok_Supermarket9812|                  null|             null| 1689714939|         null|self.investing|  null|153anlg|   true| false|           0|  false|     false|  1689714967|    1|I am looking for ...|   false|investing|    t5_2qhhq|Tool to assess fu...|https://www.reddi...|
|             StophJS|                  null|             null| 1689715325|         null|self.investing|  null|153atef|   true| false|           0|  false|     false|  1689715376|    1|This may be flagg...|   false|investing|    t5_2qhhq|Your ETF portfoli...|https://www.reddi...|
|         MrSilk13642|                  null|             null| 1689716124|         null|self.investing|  null|153b5km|   true| false|           0|  false|     false|  1689716152|    1|So I recently mad...|   false|investing|    t5_2qhhq|Why does my margi...|https://www.reddi...|
|   Chocolatechip2021|                  null|             null| 1689717578|         null|self.investing|  null|153bryb|   true| false|           1|  false|     false|  1689717610|    1|           [removed]|   false|investing|    t5_2qhhq|Where should I mo...|https://www.reddi...|
|   Chocolatechip2021|                  null|             null| 1689717830|         null|self.investing|  null|153bw79|   true| false|           0|  false|     false|  1689717854|    1|I deposited 5k in...|   false|investing|    t5_2qhhq|Where should I mo...|https://www.reddi...|
|          gamerz0111|                  null|             null| 1689719377|         null|self.investing|  null|153ckgh|   true| false|           0|  false|     false|  1689719398|    1|\*I'm currently r...|   false|investing|    t5_2qhhq|Is Zim Integrated...|https://www.reddi...|
|             Akhasic|                  null|             null| 1689719684|         null|self.investing|  null|153cp6x|   true| false|           1|  false|     false|  1689719712|    1|           [removed]|   false|investing|    t5_2qhhq|Demotez.com for s...|https://www.reddi...|
|   Chocolatechip2021|                  null|             null| 1689720385|         null|self.investing|  null|153d01a|   true| false|           0|  false|     false|  1689720417|    1|I am new to the i...|   false|investing|    t5_2qhhq|Which investment ...|https://www.reddi...|
|           stutunaru|                  null|             null| 1689721920|         null|self.investing|  null|153dno6|   true| false|           1|  false|     false|  1689721944|    1|           [removed]|   false|investing|    t5_2qhhq|Looking for a sma...|https://www.reddi...|
|         frogview123|                  null|             null| 1689722111|         null|self.investing|  null|153dqcx|   true| false|           0|  false|     false|  1689722139|    1|I recently starte...|   false|investing|    t5_2qhhq|Something like Un...|https://www.reddi...|
|           rodroprez|                  null|             null| 1689722389|         null|self.investing|  null|153dui4|   true| false|           1|  false|     false|  1689722413|    1|           [removed]|   false|investing|    t5_2qhhq|I'm 41 and have 1...|https://www.reddi...|
|whosthedoginthisscen|                  null|             null| 1689722416|         null|self.investing|  null|153dux6|   true| false|           0|  false|     false|  1689722440|    1|For instance, I c...|   false|investing|    t5_2qhhq|Is there a source...|https://www.reddi...|
|    ConsequencesFree|                  null|             null| 1689722976|         null|self.investing|  null|153e2wi|   true| false|           0|  false|     false|  1689722999|    1|How is it legal t...|   false|investing|    t5_2qhhq|When covered call...|https://www.reddi...|
|            one1jynx|                  null|             null| 1689725323|         null|self.investing|  null|153f06e|   true| false|           1|  false|     false|  1689725353|    1|           [removed]|   false|investing|    t5_2qhhq|I have 149 on Web...|https://www.reddi...|
|  Wide_Pineapple6373|                  null|             null| 1689726807|         null|self.investing|  null|153fkjn|   true| false|           1|  false|     false|  1689726824|    1|           [removed]|   false|investing|    t5_2qhhq|20 Yr old portfol...|https://www.reddi...|
|            NickSS38|                  null|             null| 1689726889|         null|self.investing|  null|153flmp|   true| false|           0|  false|     false|  1689726912|    1|Hi everyone.  I h...|   false|investing|    t5_2qhhq|Advice Needed: $8...|https://www.reddi...|
|       DinkyDoodle69|                  null|             null| 1689728018|         null|self.investing|  null|153g076|   true| false|           0|  false|     false|  1689728041|    1|Walt Disney Corpo...|   false|investing|    t5_2qhhq|As market hits ne...|https://www.reddi...|
|       DinkyDoodle69|                  null|             null| 1689728068|         null|self.investing|  null|153g0vn|   true| false|           0|  false|     false|  1689728085|    1|Walt Disney Corpo...|   false|investing|    t5_2qhhq|As market hits ne...|https://www.reddi...|
+--------------------+----------------------+-----------------+-----------+-------------+--------------+------+-------+-------+------+------------+-------+----------+------------+-----+--------------------+--------+---------+------------+--------------------+--------------------+
only showing top 20 rows
print(f"shape of the submissions dataframe is {invest_sub.count():,}x{len(invest_sub.columns)}")
[Stage 15:====================================================>   (17 + 1) / 18]                                                                                
shape of the submissions dataframe is 43,041x21
from pyspark.sql.functions import col, when, count, isnan
from pyspark.sql.types import DoubleType, FloatType

null_counts = invest_sub.select([
    count(when(col(c).isNull() | (col(c).cast(DoubleType()).isNotNull() & isnan(col(c))), c)).alias(c) 
    if invest_sub.schema[c].dataType in [DoubleType(), FloatType()]
    else count(when(col(c).isNull(), c)).alias(c)
    for c in invest_sub.columns
])

null_counts.show()
[Stage 18:====================================================>   (17 + 1) / 18]                                                                                
+------+----------------------+-----------------+-----------+-------------+------+------+---+-------+------+------------+-------+----------+------------+-----+--------+--------+---------+------------+-----+---+
|author|author_flair_css_class|author_flair_text|created_utc|distinguished|domain|edited| id|is_self|locked|num_comments|over_18|quarantine|retrieved_on|score|selftext|stickied|subreddit|subreddit_id|title|url|
+------+----------------------+-----------------+-----------+-------------+------+------+---+-------+------+------------+-------+----------+------------+-----+--------+--------+---------+------------+-----+---+
|     0|                 43041|            43041|          0|        43041|     0| 42776|  0|      0|     0|           0|      0|         0|           0|    0|       0|       0|        0|           0|    0|  0|
+------+----------------------+-----------------+-----------+-------------+------+------+---+-------+------+------------+-------+----------+------------+-----+--------+--------+---------+------------+-----+---+
from pyspark.sql.functions import length

filtered_invest_sub = invest_sub.filter(length("title") >= 3)

shortest_titles = filtered_invest_sub.select("title") \
                        .withColumn("title_length", length("title")) \
                        .orderBy("title_length") \
                        .limit(20)

shortest_titles.show(truncate=False)
[Stage 21:=================================================>      (16 + 2) / 18]                                                                                
+-------------------------+------------+
|title                    |title_length|
+-------------------------+------------+
|Investing lesson for kids|25          |
|Captive Insurance Program|25          |
|Yieldmax Has Cult Losers |25          |
|Investing 150K in Europe?|25          |
|Simple Wash Sell Question|25          |
|Where to start to end 9-5|25          |
|Working to setup taxable |25          |
|$500 TO INVEST PER MONTH!|25          |
|Sell ESPP to max out ROTH|25          |
|Richtech Robotics class B|25          |
|What happened to biotech?|25          |
|401k trading and BTC ETFs|25          |
|What am I actually doing?|25          |
|Wealthsimple and Fidelity|25          |
|Needing investment advice|25          |
|What should I do with 5k?|25          |
|Honest question of "Why?"|25          |
|Bonds for 60/40 Portfolio|25          |
|Investing trends question|25          |
|Strategy for buying stock|25          |
+-------------------------+------------+
print(f"shape of the investing submissions dataframe is {filtered_invest_sub.count():,}x{len(filtered_invest_sub.columns)}")
[Stage 22:====================================================>   (17 + 1) / 18]                                                                                
shape of the investing submissions dataframe is 43,041x21
removed_count = invest_sub.filter(invest_sub.title == "[removed]").count()

print(f"Number of '[removed]' in title: {removed_count}")
[Stage 25:=================================================>      (16 + 2) / 18]                                                                                
Number of '[removed]' in title: 0
invest_sub = filtered_invest_sub

Comments data overview and quality check

%%time
s3_path = f"s3a://{bucket}/{output_prefix_data}/comments"
print(f"reading submissions from {s3_path}")
comments = spark.read.parquet(s3_path, header=True)
print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")
reading submissions from s3a://sagemaker-us-east-1-381491950264/project/comments
shape of the comments dataframe is 3,877,393x17
CPU times: user 109 ms, sys: 25.9 ms, total: 135 ms
Wall time: 3min 19s
[Stage 29:=====================================================>(116 + 1) / 117]                                                                                
comments.printSchema()
root
 |-- author: string (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- body: string (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: long (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- edited: double (nullable = true)
 |-- gilded: long (nullable = true)
 |-- id: string (nullable = true)
 |-- link_id: string (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- retrieved_on: long (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)
invest_comments =  comments.filter(comments.subreddit == "investing")
invest_comments.show()
+--------------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+------+------+-------+----------+----------+------------+-----+--------+---------+------------+
|              author|author_flair_css_class|author_flair_text|                body|controversiality|created_utc|distinguished|edited|gilded|     id|   link_id| parent_id|retrieved_on|score|stickied|subreddit|subreddit_id|
+--------------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+------+------+-------+----------+----------+------------+-----+--------+---------+------------+
|    Alfred_Love_Song|                  null|             null|Not sure about bi...|               0| 1689744177|         null|  null|     0|jsjxms6|t3_153kjyt|t1_jsjwmaz|  1689744200|    1|   false|investing|    t5_2qhhq|
|    Alfred_Love_Song|                  null|             null|    Whats a small %?|               0| 1689744195|         null|  null|     0|jsjxnv8|t3_153kjyt|t1_jsjwyi4|  1689744216|    1|   false|investing|    t5_2qhhq|
|Waste-Temperature626|                  null|             null|> hedge against h...|               0| 1689744409|         null|  null|     0|jsjy02y|t3_153kjyt|t1_jsjwyi4|  1689744428|    1|   false|investing|    t5_2qhhq|
|      ChengSkwatalot|                  null|             null|All else equal, a...|               0| 1689744548|         null|  null|     0|jsjy7tz|t3_153af75|t3_153af75|  1689744576|    1|   false|investing|    t5_2qhhq|
|           ImamTrump|                  null|             null|No,because the US...|               0| 1689744870|         null|  null|     0|jsjypf1|t3_153af75|t3_153af75|  1689744894|    1|   false|investing|    t5_2qhhq|
|           Rooflife1|                  null|             null|I always heard “p...|               0| 1689745044|         null|  null|     0|jsjyyy4|t3_153e2wi|t3_153e2wi|  1689745074|    1|   false|investing|    t5_2qhhq|
|             Zanna-K|                  null|             null|Productivity can'...|               0| 1689745048|         null|  null|     0|jsjyz69|t3_153af75|t1_jsjml5s|  1689745074|    1|   false|investing|    t5_2qhhq|
|         MrSilk13642|                  null|             null|Oh ok sounds good...|               0| 1689745216|         null|  null|     0|jsjz8j0|t3_153b5km|t1_jsjv74i|  1689745240|    1|   false|investing|    t5_2qhhq|
|         bungholio99|                  null|             null|Most banks are bu...|               0| 1689745221|         null|  null|     0|jsjz8v6|t3_153kjyt|t3_153kjyt|  1689745240|    1|   false|investing|    t5_2qhhq|
|        Vast_Cricket|                  null|             null|There are high yi...|               0| 1689745290|         null|  null|     0|jsjzcjh|t3_153j8sp|t3_153j8sp|  1689745311|    1|   false|investing|    t5_2qhhq|
|         JenerousJew|                  null|             null|Price it in somet...|               0| 1689745353|         null|  null|     0|jsjzfwf|t3_152lv6c|t3_152lv6c|  1689745371|    1|   false|investing|    t5_2qhhq|
|  T_Dizzle_My_Nizzle|                  null|             null|This will likely ...|               0| 1689745469|         null|  null|     0|jsjzm7w|t3_153af75|t3_153af75|  1689745493|    1|   false|investing|    t5_2qhhq|
|              mrclut|                  null|             null|I don't have much...|               0| 1689745743|         null|  null|     0|jsk00nh|t3_153af75|t3_153af75|  1689745766|    1|   false|investing|    t5_2qhhq|
|        Bates_master|                  null|             null|get it while it's...|               0| 1689745787|         null|  null|     0|jsk02ve|t3_153af75|t3_153af75|  1689745814|    1|   false|investing|    t5_2qhhq|
|            ddlJunky|                  null|             null|It's inevitable t...|               0| 1689745843|         null|  null|     0|jsk05ve|t3_152lv6c|t1_jsj8cau|  1689745866|    1|   false|investing|    t5_2qhhq|
|        MidKnight148|                  null|             null|I'm not concerned...|               0| 1689745891|         null|  null|     0|jsk08ie|t3_153af75|t3_153af75|  1689745912|    1|   false|investing|    t5_2qhhq|
|        YamiChan1016|                  null|             null|No, you misunders...|               0| 1689746298|         null|  null|     0|jsk0twr|t3_152lv6c|t1_jsk05ve|  1689746320|    1|   false|investing|    t5_2qhhq|
|        YamiChan1016|                  null|             null|Wine or art is ju...|               0| 1689746342|         null|  null|     0|jsk0wa9|t3_152lv6c|t1_jsk05ve|  1689746367|    1|   false|investing|    t5_2qhhq|
|           dCrumpets|                  null|             null|The republicans c...|               0| 1689746393|         null|  null|     0|jsk0yxk|t3_153af75|t1_jsj7uro|  1689746422|    1|   false|investing|    t5_2qhhq|
|           fraylovze|                  null|             null|Are you just hold...|               0| 1689746440|         null|  null|     0|jsk11f7|t3_15357vy|t3_15357vy|  1689746470|    1|   false|investing|    t5_2qhhq|
+--------------------+----------------------+-----------------+--------------------+----------------+-----------+-------------+------+------+-------+----------+----------+------------+-----+--------+---------+------------+
only showing top 20 rows
from pyspark.sql.functions import col, count, when, isnan
from pyspark.sql.types import DoubleType, FloatType

# Calculate null counts for each column
comments_null_counts = invest_comments.select([
    count(when(col(c).isNull() | (col(c).cast(DoubleType()).isNotNull() & isnan(col(c))), c)).alias(c) 
    if invest_comments.schema[c].dataType in [DoubleType(), FloatType()]
    else count(when(col(c).isNull(), c)).alias(c)
    for c in invest_comments.columns
])

comments_null_counts.show()
[Stage 33:=====================================================>(116 + 1) / 117]                                                                                
+------+----------------------+-----------------+----+----------------+-----------+-------------+------+------+---+-------+---------+------------+-----+--------+---------+------------+
|author|author_flair_css_class|author_flair_text|body|controversiality|created_utc|distinguished|edited|gilded| id|link_id|parent_id|retrieved_on|score|stickied|subreddit|subreddit_id|
+------+----------------------+-----------------+----+----------------+-----------+-------------+------+------+---+-------+---------+------------+-----+--------+---------+------------+
|     0|                516482|           516482|   0|               0|          0|       485737|507049|     0|  0|      0|        0|           0|    0|       0|        0|           0|
+------+----------------------+-----------------+----+----------------+-----------+-------------+------+------+---+-------+---------+------------+-----+--------+---------+------------+
from pyspark.sql.functions import length
filtered_invest_comments = invest_comments.filter(length("body") >= 3)

shortest_comments = filtered_invest_comments.select("body") \
                        .withColumn("body_length", length("body")) \
                        .orderBy("body_length") \
                        .limit(20)

shortest_comments.show(truncate=False)
[Stage 36:=====================================================>(116 + 1) / 117]                                                                                
+----+-----------+
|body|body_length|
+----+-----------+
|VTI |3          |
|ISO |3          |
|Lol |3          |
|Lol |3          |
|Lol |3          |
|idk |3          |
|VTI |3          |
|Oof |3          |
|Lol |3          |
|Yes |3          |
|Yep |3          |
|XOM |3          |
|Yup |3          |
|??? |3          |
|Lol |3          |
|lol |3          |
|VOO |3          |
|Yes |3          |
|VGT |3          |
|VTI |3          |
+----+-----------+
print(f"shape of the comments dataframe is {filtered_invest_comments.count():,}x{len(filtered_invest_comments.columns)}")
[Stage 37:=====================================================>(116 + 1) / 117]                                                                                
shape of the comments dataframe is 515,133x17
comments_removed_count = invest_comments.filter(invest_comments.body == "[removed]").count()

print(f"Number of '[removed]' in body: {comments_removed_count}")
[Stage 40:=====================================================>(116 + 1) / 117]                                                                                
Number of '[removed]' in body: 7083
invest_comments = filtered_invest_comments.filter(invest_comments.body != "[removed]")
print(f"shape of the invest_comments dataframe is {invest_comments.count():,}x{len(invest_comments.columns)}")
[Stage 43:=====================================================>(116 + 1) / 117]                                                                                
shape of the invest_comments dataframe is 508,050x17

EDA Questions and Analysis

Business goal 1

Determining the most common words/topics on investing on sub reddit.

Technical proposal

from pyspark.sql.functions import col, lower, regexp_replace, explode, split
from pyspark.ml.feature import StopWordsRemover

# Step 1: Text Preprocessing
# Convert 'body' text to lowercase and remove punctuation
invest_comments_clean = invest_comments.withColumn(
    "cleaned_body", 
    lower(regexp_replace(col("body"), "[^a-zA-Z\\s]", ""))
)

# Split text into words
invest_comments_words = invest_comments_clean.withColumn(
    "words", split(col("cleaned_body"), "\\s+")
)

# Remove stop words using Spark's StopWordsRemover
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
invest_comments_filtered = remover.transform(invest_comments_words)

# Step 2: Word Frequency Analysis
# Explode the 'filtered_words' column to count word occurrences
words_df = invest_comments_filtered.select(explode(col("filtered_words")).alias("word"))

# Filter out any empty strings that might appear
words_df = words_df.filter(col("word") != "")

# Count the frequency of each word
word_counts = words_df.groupBy("word").count().orderBy(col("count").desc())

# Show the top 20 most common words
word_counts.show(20)
[Stage 46:=====================================================>(116 + 1) / 117]                                                                                
+-------------+-----+
|         word|count|
+-------------+-----+
|         like|92144|
|        money|85407|
|       please|74849|
|            k|71039|
|         dont|68591|
|       market|67559|
|        years|59518|
|          get|56161|
|           im|53827|
|        think|52949|
|         time|49489|
|         make|48734|
|       people|48451|
|         year|46358|
|automatically|46324|
|      removed|46290|
|         post|45833|
|         also|45677|
|      account|44345|
|          one|43235|
+-------------+-----+
only showing top 20 rows
comments_word_counts_top50 = word_counts.limit(50).toPandas()

comments_word_counts_top50.to_csv("comments_word_counts_top50.csv", index=False)
                                                                                
import matplotlib.pyplot as plt

# Converting the Spark DataFrame to Pandas
top_words = word_counts.limit(20).toPandas()

# Plot the top 20 most common words
plt.figure(figsize=(12, 6))
plt.bar(top_words['word'], top_words['count'], alpha=0.7)
plt.title('Top 20 Most Common Words in Investing Subreddit', fontsize=16)
plt.xlabel('Words', fontsize=14)
plt.ylabel('Frequency of words (in thousands)', fontsize=14)
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.tight_layout()
plt.show()
                                                                                

Business goal 2

Identify trending topics and key investment terms in Reddit discussions to inform content strategy and product development for financial services.

Technical proposal

!pip install wordcloud
Collecting wordcloud
  Using cached wordcloud-1.9.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Requirement already satisfied: numpy>=1.6.1 in /opt/conda/lib/python3.11/site-packages (from wordcloud) (1.26.4)
Requirement already satisfied: pillow in /opt/conda/lib/python3.11/site-packages (from wordcloud) (10.4.0)
Requirement already satisfied: matplotlib in /opt/conda/lib/python3.11/site-packages (from wordcloud) (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib->wordcloud) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.11/site-packages (from matplotlib->wordcloud) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib->wordcloud) (4.54.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib->wordcloud) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib->wordcloud) (24.1)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib->wordcloud) (3.2.0)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.11/site-packages (from matplotlib->wordcloud) (2.9.0)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
Using cached wordcloud-1.9.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (547 kB)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.4
from wordcloud import WordCloud
import re
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import nltk

# Download stopwords if not already available
nltk.download('stopwords')

# Preprocess text and create a single text corpus from the 'body' column in invest_comments
text = " ".join(comment for comment in invest_comments.select('body').rdd.flatMap(lambda x: x).collect())
text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation

# Define and remove common stopwords
stop_words = set(stopwords.words('english'))
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=stop_words, colormap='viridis').generate(text)

# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Popular Terms in Investing Subreddit Comments')
plt.show()
plt.savefig('wordcloud.png')
[nltk_data] Downloading package stopwords to /home/sagemaker-
[nltk_data]     user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
                                                                                

<Figure size 640x480 with 0 Axes>

Business goal 3

Determine whether certain events lead to increased discussion volume.

Technical proposal

!pip install seaborn
Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in /opt/conda/lib/python3.11/site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in /opt/conda/lib/python3.11/site-packages (from seaborn) (2.2.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in /opt/conda/lib/python3.11/site-packages (from seaborn) (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.0)
Requirement already satisfied: cycler>=0.10 in /opt/conda/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.54.1)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.1)
Requirement already satisfied: pillow>=8 in /opt/conda/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/conda/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.2.0)
Requirement already satisfied: python-dateutil>=2.7 in /opt/conda/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2023.3)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.2)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
Using cached seaborn-0.13.2-py3-none-any.whl (294 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2
from pyspark.sql.functions import from_unixtime, weekofyear, hour, avg
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Convert created_utc to datetime and extract week_of_the_year and hour_of_the_day
invest_comments = invest_comments.withColumn("datetime", from_unixtime(invest_comments.created_utc))
invest_comments = invest_comments.withColumn("week_of_the_year", weekofyear("datetime"))
invest_comments = invest_comments.withColumn("hour_of_the_day", hour("datetime"))

# Group by week_of_the_year and hour_of_the_day and calculate average score
engagement_data = invest_comments.groupBy("week_of_the_year", "hour_of_the_day").agg(avg("score").alias("average_score"))

# Convert to Pandas for visualization
engagement_pd = engagement_data.toPandas()

# Pivot the data for heatmap
heatmap_data = engagement_pd.pivot(index="week_of_the_year", columns="hour_of_the_day", values="average_score")

# Basic plot for checking data
plt.figure(figsize=(15, 10))
sns.heatmap(heatmap_data, cmap='viridis', annot=False, fmt='.2f', linewidths=0.5, cbar=True)
plt.title('Average Number of Comments in Investing subreddit per Hour and Week of the Year', fontsize=16)
plt.xlabel('Hour of the Day (in hours)', fontsize=14)
plt.ylabel('Week of the Year (in weeks)', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.tight_layout()  # Ensure everything fits nicely
plt.show()
                                                                                

Business goal 4

Determine which stocks are most frequently mentioned on Reddit to gauge general interest and popularity.

Technical proposal

# Extracting Stock Tickers from Reddit Data 

from pyspark.sql.functions import regexp_extract, col

# Defining a regular expression to match stock tickers (e.g., $AAPL)
# Adjust regex if needed for more accurate ticker detection
ticker_regex = r"\b[A-Z]{1,5}\b"

# Extracting possible tickers from the 'body' column
potential_tickers = invest_comments.withColumn("ticker", regexp_extract(col("body"), ticker_regex, 0))

# Filtering non-empty tickers
valid_tickers = potential_tickers.filter(col("ticker") != "")

# Counting mentions of each ticker
ticker_counts = valid_tickers.groupBy("ticker").count().orderBy(col("count").desc())

# Showing the most mentioned tickers
ticker_counts.show(20)
[Stage 73:=====================================================>(116 + 1) / 117]                                                                                
+------+------+
|ticker| count|
+------+------+
|     I|180388|
|     A|  7143|
|   IRA|  6795|
|    US|  5708|
|     S|  4823|
|    OP|  3990|
|   VOO|  3343|
|   ETF|  3053|
|  HYSA|  2392|
|    AI|  2347|
|   VTI|  2250|
|   DCA|  1449|
|   SPY|  1383|
|    CD|  1296|
|   BTC|  1205|
|  NVDA|  1091|
|     T|  1062|
|   SEC|  1015|
|    DM|  1011|
|   WSB|   872|
+------+------+
only showing top 20 rows
!pip install yfinance
Collecting yfinance
  Using cached yfinance-0.2.50-py2.py3-none-any.whl.metadata (5.5 kB)
Requirement already satisfied: pandas>=1.3.0 in /opt/conda/lib/python3.11/site-packages (from yfinance) (2.2.3)
Requirement already satisfied: numpy>=1.16.5 in /opt/conda/lib/python3.11/site-packages (from yfinance) (1.26.4)
Requirement already satisfied: requests>=2.31 in /opt/conda/lib/python3.11/site-packages (from yfinance) (2.32.3)
Collecting multitasking>=0.0.7 (from yfinance)
  Using cached multitasking-0.0.11-py3-none-any.whl.metadata (5.5 kB)
Requirement already satisfied: lxml>=4.9.1 in /opt/conda/lib/python3.11/site-packages (from yfinance) (5.3.0)
Requirement already satisfied: platformdirs>=2.0.0 in /opt/conda/lib/python3.11/site-packages (from yfinance) (3.11.0)
Requirement already satisfied: pytz>=2022.5 in /opt/conda/lib/python3.11/site-packages (from yfinance) (2023.3)
Requirement already satisfied: frozendict>=2.3.4 in /opt/conda/lib/python3.11/site-packages (from yfinance) (2.4.6)
Collecting peewee>=3.16.2 (from yfinance)
  Using cached peewee-3.17.8-cp311-cp311-linux_x86_64.whl
Requirement already satisfied: beautifulsoup4>=4.11.1 in /opt/conda/lib/python3.11/site-packages (from yfinance) (4.12.3)
Collecting html5lib>=1.1 (from yfinance)
  Using cached html5lib-1.1-py2.py3-none-any.whl.metadata (16 kB)
Requirement already satisfied: soupsieve>1.2 in /opt/conda/lib/python3.11/site-packages (from beautifulsoup4>=4.11.1->yfinance) (2.5)
Requirement already satisfied: six>=1.9 in /opt/conda/lib/python3.11/site-packages (from html5lib>=1.1->yfinance) (1.16.0)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.11/site-packages (from html5lib>=1.1->yfinance) (0.5.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /opt/conda/lib/python3.11/site-packages (from pandas>=1.3.0->yfinance) (2.9.0)
Requirement already satisfied: tzdata>=2022.7 in /opt/conda/lib/python3.11/site-packages (from pandas>=1.3.0->yfinance) (2024.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.11/site-packages (from requests>=2.31->yfinance) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.11/site-packages (from requests>=2.31->yfinance) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/conda/lib/python3.11/site-packages (from requests>=2.31->yfinance) (1.26.19)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.11/site-packages (from requests>=2.31->yfinance) (2024.8.30)
Using cached yfinance-0.2.50-py2.py3-none-any.whl (102 kB)
Using cached html5lib-1.1-py2.py3-none-any.whl (112 kB)
Using cached multitasking-0.0.11-py3-none-any.whl (8.5 kB)
Installing collected packages: peewee, multitasking, html5lib, yfinance
Successfully installed html5lib-1.1 multitasking-0.0.11 peewee-3.17.8 yfinance-0.2.50
import pandas as pd
import plotly.express as px
import yfinance as yf
from pyspark.sql.functions import col, regexp_extract, count

# Step 1: Extract Stock Tickers and Mentions from Reddit Data
def extract_ticker_mentions(reddit_df, ticker_regex):
    """
    Extract and count stock tickers mentioned in Reddit posts.
    :param reddit_df: PySpark DataFrame with Reddit comments/submissions
    :param ticker_regex: Regex pattern to identify stock tickers (e.g., capital letters 1-5 characters)
    :return: Pandas DataFrame with tickers and their mention counts
    """

    tickers_df = (
        reddit_df
        .withColumn('ticker', regexp_extract(col('body'), ticker_regex, 0))
        .filter(col('ticker') != "")
        .groupBy('ticker')
        .agg(count('*').alias('mention_count'))
        .orderBy(col('mention_count').desc())
        .limit(50)  # Top 50 tickers
        .toPandas()
    )
    return tickers_df

ticker_regex = r'\b[A-Z]{1,5}\b'
reddit_df = spark.read.parquet(s3_path)
top_tickers = extract_ticker_mentions(reddit_df, ticker_regex)
                                                                                
# Step 2: Fetch Market Data with Company Names from Yahoo Finance
def fetch_market_data_yahoo(tickers):
    """
    Fetch market data including company names, market capitalization, and outstanding shares.
    :param tickers: List of stock tickers.
    :return: DataFrame with ticker, company name, market_cap, and trade_count.
    """
    market_data = []
    
    for ticker in tickers:
        try:
            stock = yf.Ticker(ticker)
            info = stock.info
            market_data.append({
                "ticker": ticker,
                "company_name": info.get("shortName", ticker),  # Use ticker if name is unavailable
                "market_cap": info.get("marketCap", 0),  # Market Capitalization
                "trade_count": info.get("sharesOutstanding", 0)  # Outstanding Shares
            })
        except Exception as e:
            print(f"Error fetching data for {ticker}: {e}")
    
    return pd.DataFrame(market_data)

market_data = fetch_market_data_yahoo(top_tickers['ticker'].tolist())

# Step 3: Merge Data
merged_data = pd.merge(top_tickers, market_data, on='ticker', how='inner')

# Remove invalid or zero values
merged_data = merged_data[
    (merged_data['market_cap'] > 0) & 
    (merged_data['trade_count'] > 0) &
    (merged_data['mention_count'] > 0)
]

def create_interactive_bubble_plot(data, save_as="html"):
    """
    Create an interactive bubble plot without a dropdown and save it.
    :param data: Pandas DataFrame with 'mention_count', 'market_cap', 'trade_count', and 'company_name'
    :param save_as: Format to save the plot ('png' or 'html')
    """
    # Combine ticker and company name for display
    data['label'] = data['ticker'] + " - " + data['company_name']
    
    # Create the figure
    fig = px.scatter(
        data,
        x='mention_count',
        y='market_cap',
        size='trade_count',
        color='ticker',
        hover_name='label',  # Show company name and ticker in hover
        size_max=60,
        labels={
            'mention_count': 'Mention Count (in thousands)',
            'market_cap': 'Market Capitalization in Billions (USD)',
            'trade_count': 'Trade Volume'
        },
        title="Stock Popularity vs Market Cap and Trade Volume",
    )

    # Save the plot
    if save_as == "png":
        fig.write_image("bubble_plot.png")
        print("Plot saved as 'bubble_plot.png'")
    elif save_as == "html":
        fig.write_html("bubble_plot.html")
        print("Plot saved as 'bubble_plot.html'")

    fig.show()

# Example usage
create_interactive_bubble_plot(merged_data, save_as="html")
ERROR:yfinance:404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/DYOR?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=finance.yahoo.com&formatted=false&symbol=DYOR&crumb=CuJz0cgaUfS
ERROR:yfinance:404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/USDT?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=finance.yahoo.com&formatted=false&symbol=USDT&crumb=CuJz0cgaUfS
ERROR:yfinance:404 Client Error: Not Found for url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/MATIC?modules=financialData%2CquoteType%2CdefaultKeyStatistics%2CassetProfile%2CsummaryDetail&corsDomain=finance.yahoo.com&formatted=false&symbol=MATIC&crumb=CuJz0cgaUfS
Plot saved as 'bubble_plot.html'